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Abstract 

Despite their biological importance, a significant number of genes for secondary metabolite biosynthesis 
(SMB) remain undetected due largely to the fact that they are highly diverse and are not expressed under a 
variety of cultivation conditions. Several software tools including SMURF and antiSMASH have been devel- 
oped to predict fungal SMB gene clusters by finding core genes encoding polyketide synthase, nonribosomal 
peptide synthetase and dimethylallyltryptophan synthase as well as several others typically present in the 
cluster. In this work, we have devised a novel comparative genomics method to identify SMB gene clusters 
that is independent of motif information of the known SMB genes. The method detects SMB gene clusters 
by searching for a similar order of genes and their presence in nonsyntenic blocks. With this method, we 
were able to identify many known SMB gene clusters with the core genes in the genomic sequences of 1 0 fila- 
mentous fungi. Furthermore, we have also detected SMB gene clusters without core genes, includingthe kojic 
acid biosynthesis gene cluster of Aspergillus oryzae. By varying the detection parameters of the method, a sig- 
nificant difference in the sequence characteristics was detected between the genes residing inside the clus- 
ters and those outside the clusters. 
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1 . Introduction 

Secondary metabolites a re an important resource for 
bioactive compounds, including lead compounds for 
new drugs, effective components of functional foods 
and chemical raw materials. Although a variety of sec- 
ondary metabolites have been discovered primarily 
from actinomycetes, fungi and plants, a significantly 
larger number of secondary metabolites are thought 
to remain undetected due to the silencing of 



corresponding biosynthesis genes undertheconditions 
used for screening.' ~^ 

The genes responsible for the biosynthesis of each 
secondary metabolite are often clustered in the 
genome."^ Furthermore, the basic structures of the 
known secondary metabolites are often synthesized 
by the so-called core genes, polyketide synthase 
(PKS), nonribosomal peptide synthetase (NRPS) and 
dimethylallyltryptophan synthase (DMAT). Thus, 
BLAST and Pfam searches for domains in polypeptides 
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encoded by these genes have served as pov\/erful means 
in identifying essential genes in secondary metabolism 
biosynthesis (SMB) gene clusters. Clust Scan and 
CLUSEAN identify core genes for SMB by searching the 
functional domains and motifs of PKS and NRPS.'^'^ 
Other software tools, such as SMURF and antiSMASH, 
first identify the core genes using their motifs and 
then extend the flanking genes with homology to 
genes frequentlyfound inthe known SMBgeneclusters, 
including hyd roxylases, oxidases, methylases, transcrip- 
tion factors (typically Zn(ll)Cys6 binuclear cluster 
types) and transporter genes.^'^ However, some SMB 
gene clusters, such as the oxylipin^ and kojic acid^ bio- 
synthesis gene clusters, lackcore genes in theirclusters. 
These examples indicate the importance of devising a 
method for the prediction of SMB gene clusters 
without using the known motifs of the core genes. 

Recently, the development of next-generation se- 
quencing technology has dramatically accelerated the 
sequencing of the genomes of diverse organisms. Even 
the genomes of filamentous fungi, which have relatively 
large genome sizes among microbes, can be accurately 
seq uenced without reference genomes.' ° The extreme- 
ly high throughput and low cost of sequencing have 
increased the motivation to sequence the genomes of 
closely related species and even strains of the same 
species" for detailed and comprehensive genome 
comparisons. 

In this study, we developed a novel method that 
applies a comparative genomics approach to predict 
SMB gene clusters, including those without core genes. 
This method depends on the characteristics of second- 
ary metabolism genes, namely that they are highly 
enriched in non-syntenic blocks'^ and are rarely ortho- 
logous even between clusters producing similar com- 
pounds due to generally high sequence diversity.'^ 
Our method successfully predicted SMB gene clusters 
without using motif information from known genes in 
the SMB gene clusters. Through the optimization of 
the prediction parameters, we have also depicted the 
structural characteristics of the SMB gene clusters. 

2. Materials and methods 

2.7. Genome data 

The nucleotide and amino acid sequences of the 
genomes and deduced coding sequences, respectively, 
were retrieved from the following databases: /4sper^;7/ws 
flauus (accession no. EQ963472~EQ963493) and 
A. oryzae (accession no. AP0071 50~AP0071 77) from 
DDBJ/EMBL/GenBank DNA database; A. fumigatus, 
A. nidulansandA. terrews from the/4sper^;7/us comparative 
data base (http: //www.broad i nst itute.org/a nnotation/ 
genome/aspergi I lus_group/Mu ltiHome.htm I); 
Magnaporthe grisea from the Magnaporthe comparative 



database (http://www.broadinstitute.org/annotation/ 
genome/magnaporthe_comparative/MultiHome.html); 
Chaetomium globosum from the Chaetomium globosum 
database (http://vww.broadinstitute.org/annotation/ 
genome/chaetomium_globosum); and Fusarium grami- 
nearum, F. oxysporum and F verticillioides from the 
Fusarium comparative database (http://www.broad 
institute.org/annotation/genome/fusarium_group/Multi 
Home.html) at The Broad Institute. The gene IDs of 
GenBank was assigned to the genes annotated by Broad 
Institute by using BLASTP search. 

2.2. Algorithm overview 

The method for prediction of gene clusters devised in 
this study consists of three steps. The first step is to 
search pairwise similarity between the genes in the 
two genomes and to perform successive alignment 
detections of homologous genes (Fig. 1 a and b). This 
step is based on the assumption that SMB gene clusters 
that produce compoundsthatare not identical but that 
have common basic structures have similar member 
genes. Low gap and mismatch penalties allow the de- 
tection of a gene cluster pair containing inversions 
and/or deletions in their gene members. The second 
step is to correct the boundary of the predicted gene 
cluster. This step is achieved by scoring homologous 
genes, considering genes outside but proximal to the 
predicted gene cluster (Fig. 1 c and d). The third step is 
to enrich gene clusters with higher probability to func- 
tion as SMB gene clusters via synteny analysis (Fig. 1 e). 
Secondary metabolism genes are highly enriched on 
nonsyntenic blocks when the A. oryzae genome is com- 
pared with the genome of A. fumigatusor A. nidulans.^^ 
Thus, of the gene clusters predicted in the prior step, 
those forming syntenic blocks can be eliminated 
(Fig. 1e). 

2.3. Identification of homologous and orthologous 
gene pairs 

Priortocomparingtheorderof genes between a pair 
of genomes, homologous gene pairs were identified 
in an amino acid homology search (Fig. la) using 
BLASTP''*-'^ with e-values (Paraml) of l.Oe-5, 
l.Oe-10, l.Oe-15, l.Oe-30 or l.Oe-50 as 
thresholds. Orthologs were determined using the 
bidirectional best BLASTP hit method. 

2.4. Identification of the seed region pair 
for a gene cluster 

The regions for which the order of genes was con- 
served between the genome pair were searched by 
local alignment of the genes with the Smith- 
Waterman algorithm'^'''' (Fig. 1 b). The genes in the 
first and second genomes were defined asx,- (/ = 1 , 2, 
. . ., /) and yj (;'=!, 2,..., j), respectively. A matrix 
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(a) Homology Search 



(b) Local Alignment 
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Figure 1 . Overview of the prediction method for SMB gene clusters, (a) Broken lines represent homologous gene pairs between two genomes. 
Each pairof 'x,, '-'yji '/Xii'-'yii'i'Xi^'-'yj^i'Xjs'-'yjs'i'Xie'-'yjs'and'Xjs'-'Xjj' represents a homolog. Thex, and y, represent genes in the first and 
thesecond genomes, respectively, (b) The genes were aligned in the genome using the Smith-Waterman algorithm (Param2 = - 1 ). Pairs of 
contiguous genes from 'x,, 'to'x,g' in genome 1 and from 'y^, 'to'y^j' represent an exam pie identified as a seed for pred icting a gene cluster (Rq 
orotherseed regions), (c) The seed was extended until the prescribed length (ParamS = 3 5). The symbols and represent the numbers of 
genes added to the seed region of the first and the second genomes, respectively. X and Y represent extended clusters in the first and the 
second genomes, respectively, (d) The boundaries were corrected (Param4 = - 1 ), and a pair of candidate gene clusters, 'x,-, ' through 
'Xig' and through 'yjg', was identified. The symbols /'begin and /'end represent the locations of the genes at the beginning and end, 
respectively, of the cluster in the first genome. The symbols y'begin and jend represent the corresponding gene locations in the second 
genome. The CB value is the sum of the maximum scores for the upstream and the downstream boundaries of a predicted cluster. The 
integers are indicated as an example for the particular alignment of clusters represented in this figure, (e) Synteny analysis was 
performed to distinguish the SMB gene cluster from the syntenic block (SB). The SB, a subset of X and Y, represents a set of genes aligned 
to create a contiguous block of orthologous gene pairs located within the defined distance between neighboring genes (ParamS = 
1 0 kb). The above parameters are examples and not necessarily those used for the actual analyses. 



(SW) of (7 + 1 ) X (/ + 1 ) was prepared by calculating 
each cell score according to the following formulas: 

rswa-i,/-i) + i 

when a pairof genes,x, and yy, are homologous. 

r SW(;~ 1,/ - l)+P^isr,,atch 

S\N(ii)-maxl SWO', / - 1 ) + Pg^p 
I 0 
when X; and yj are not homologous. 



Values of 



forPgapandPm.sm 



0.1, -0.2, -0.4, -0.5 or -1 were used 
atch»a gapand a mismatch penalty, re- 
spectively (Param2). After the scores were calculated 
for all of the cells in the matrix based on the similarity 
between any gene pair, the gene cluster coordinates 
were obtained by tracing the cells from the pair with 
the maximum score to that with a score of 0 (Fig. 1 b). 
The pair of gene cluster coordinates was defined as Rq, 
which was used as one of the seeds for the predicted 
gene clusters. 

'^o = {{j^,i^),U2,i2),■■■Unl,in)}, where < jz <■■■ 

< jm^h < '2 < ■ •• < 'n 
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Other seeds were detected by a traceback of the same 
score matrix (see Supplementary Fig. SI ). These seeds 
were subjected to the correction of boundaries in the 
next step. Gene cluster coordinates were also searched 
using the reverse orientation for one of the two 
genomes. 



2.5. Correcting gene cluster boundaries 

Rq and other seed regions, consisting of the genes x,- 
('i <i<in) and j/y (y'l <j<j„), may not have the 
correct boundaries as a gene cluster for various 
reasons, particularly when part of the cluster has a 
reversed orientation in the gene order. This reversal 
could be caused by a small inversion. 

Inthisstudy,by taking the actual, experimentally con- 
firmed sizes of the clusters into consideration, the 
minimum number of homologous gene pairs was set 
to 3, and the values 1 5, 25, 35, 45, 55 and 65 were 
used for the maximum number of genes contained in 
a gene cluster (Param3). 

After the detection of seeds with cluster sizes under 
the threshold, the same number of genes located in 
the vicinity of the seeds was added to both ends of the 
seeds to extend the cluster size to a predefined 
number of genes for successive boundary corrections 
(Fig. 1 c). If the number of genes to be added was odd, 
an additional gene was added to either of the two 
ends of the cluster. In this study, the same values 
derived with Param3 were used as the cluster sizes 
after the addition. When 35 genes were applied to 
Param3, each set of genes that extended the seed in 
the first and the second genomes was defined as X 
and y, respectively, and each number of genes added 
to the seed was defined as and 1^, respectively: 

X = {Xi\i : integer and satisfying /i - Ix < i 
< in + Ix, where in - /i + 2/^ + 1 =3 5} 

y = {yj\j : integer and satisfying y'l - ly <j 
< jm + ly, where - ;i + 2/^ + 1 = 3 5} 

To correct the boundaries of a seed of clustered genes, 
homologous genes were scored from the gene located 
at the center of the cluster to both ends of the cluster. 
A score, SC, was calculated for each gene member in X 
according to the following formulas (Fig. 1 d): 

r 1, / = (/i +/„)/2 

SC(/) = \ SC(/ +!) + !, / < (/i + i„)l2 (1 ) 
[sC(/-1) + l, />(/i+/„)/2 

when Xj has at least one homolog among the members 
of yand 



negative! ' — ('l + ''n)/^ 

SC(/+1)+Pnegative, / < (/'l + /„)/2 (2) 
SC(/ - 1)+Pnegative, ' > ('l + 'n)/2 

whenx/ has no homologs among the members of y. 

^negative represents a penalty score for the gene that 
has no homologs in the paired extended seed. Based 
on the scores for all of the member genes, a gene 
cluster candidate was defined between the genes 
('begin and /end) with the maximum scores in the 
regions indicated by (1-4), respectively. To evaluate 
similarity between a pair of detected clusters, a CB 
value was defined as the sum of the maximum scores 
at both ends. The boundary correction Y was deter- 
mined in the same manner. Consequently, a pair of 
gene clustercandidates,x/ and y,-, was defined as follows: 

-'^/('begin < ' < 'end), J//' (ibegin <j < iend) 

In this Study, the values -0.1, -0.2, -0.3, -0.4, -0.5 
and - 1 were used as the negative penalty (Param4). 



2.6. Synteny analysis 

Secondary metabolism genes are highly enriched in 
nonsyntenic blocks.'^ Secondary metabolism genes, 
which have high sequence diversity in general,^ ^ are 
rarely orthologous in the comparison of genomes 
between two species. In contrast, syntenic blocks, in 
which genes existing across species commonly accumu- 
late, have a high proportion of orthologs. Thus, candi- 
date gene clusters that have a high probability of 
secondary metabolism biosynthesis genes can be 
selected by referringtotheir localization in nonsyntenic 
blocks (Fig. 1 e). 

Orthologous gene pairs between Xand y were aligned 
to create contiguous blocks until no more orthologs 
were identified within the threshold range of the inter- 
genic distances in both genomes (Fig. 1 e). Contiguous 
blocks composed of at least two orthologs were 
defined as syntenic blocks (SBs). Non-orthologous 
genes inserted between orthologs were allowed within 
the threshold of an intergenic distance of 5, 1 0, 20, 
30, 40 or 50 kb(Param5). 

SBs were subsets of extended seeds,Xand Y. If the per- 
centage of the member genes in the subset segment for 
the numberof genes in the entire extended cluster was 
less than the threshold, the corresponding candidate 
gene cluster was selected as a predicted secondary me- 
tabolism gene cluster. In this study, 1 0, 1 5, 20, 25, 30 
and 35%were used asthethresholdsfortheSB percent- 
age (Param6). Multiple predicted clusters overlapping 
each other were merged into a single cluster similarly 
to methods used in other SMB gene cluster prediction 
software.*^''** 
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3. Results and discussion 

3.1 . Effect of each parameter on the prediction 

To detect SMB gene cluster candidates, the genome 
sequences of 1 0 species of filamentous fungi, including 
A. oryzae (see Materials and methods), were subjected 
to a comprehensive pairwise comparison, with the ex- 
ception of between identical genomes. We first 
detected known SMB gene clusters from the genomes 
of A. flavus and A. fumigatus to optimize the parameters 
ofour method. The clusters that the method predicted 
for the biosynthesis of aflatoxin and gliotoxin from A. 
fiavus and for the biosynthesis of ergot, epipolythio- 
dioxopiperazine-type toxin (ETP), fumitremorgin, glio- 
toxin, melanin, Pesl, pseurotin and siderophore from 
A. fumigatus, are listed in Table 1 and were subjected 
to the analysis of differences in boundary positions 
when compared with those from the experimentally 
confirmed clusters. Absolute values of the differences 
in gene numbers at the upstream and the downstream 
boundaries were summed to generate a value defined 
as the prediction error. The minimum error was 
obtained from all of the clusters predicted for each 
gene cluster, and the average of the minimum errors 
for the 1 0 gene clusters from/A./Jc/t'us and/4. /um/^otws 
described above was then calculated at each value 
for the parameters. As shown in Fig. 2, a combination 
including Paraml=e-10, Param2 = -0.2 and 
Param4 = -0.3 gave the smallest errorsforthe predic- 
tion of the cluster boundaries. ParamS (extension 
length), Param5 (intergenic distance) and Param6 
(permissible ratio of syntenic blocks) had little influ- 
ence on the prediction of gene clusters within the 
range used in this study. Consequently, ParamS = 35 
genes, Param5 = 1 0 kb and Param6 = 25% were used. 

To evaluate the performance of our method, we 
detected known SMB gene clusters using the genomes 
of 1 0 filamentous fungal species. Of the 24 gene clus- 
ters that have been identified to date, together with 
their corresponding products (Supplementary Table 
SI), 21 gene clusters were successfully detected with 
the optimized parameters described above (Table 1). 
The minimum and the maximum errors among all of 
the predicted gene clusters and the error for the 
cluster with the maximum CB value are also indicated. 
Figure 3 shows the effects of Paraml, Param2 and 
Param4 on the number of known SMB gene clusters 
that were detected within the minimum error of 1 0 
genes. The number of clusters increased by decreasing 
the stringency of Paraml and Param2 simply because 
of the increased sensitivity for seed detection. A 
similar increase in the detected clusters was observed 
in the detection performed with the A. fumigatus 
genome regardless of whether the clusters were previ- 
ously known (Fig. 4a and b). In contrast, decreasing 
the stringency of Param4 resulted in a decrease in the 



number of detected clusters (Fig. 3c). This result was 
also observed for the detection of clusters with fewer 
member genes, i.e. <1 0 (Fig. 4c). Some gene clusters 
located within a short distance of each other in the 
genome were predicted as a merged single cluster of 
genes when a low stringency for Param4 was given. 
Accordingly, this low stringency led to an increase in 
the number of clusters with large cluster sizes (Fig. 4c). 

Although Param2 and Param4 are penalties for the 
alignment of homologous genes, the former takes the 
order of the genes into consideration, but the latter 
does not. A decrease in the n umber of predicted clusters 
for a stringent Param2 and little change on the number 
by Param4,asshown above, indicated that a pairof gene 
clusters has similar gene contents in terms of sequence 
similarity,althoughtheorderofthe genes might be par- 
tially rearranged, such as by inversion. 

3.2. Detailed analysis of successful and failed 
predictions of known gene clusters 

Of the 21 known SMB gene clusters predicted using 
our method (Table 1 ), some of the clusters were pre- 
dicted by comparing two genomes belonging to differ- 
ent genera. For example, the gene clusters for the 
biosynthesis of aflatoxin in A. fiavus and fumonisin in 
f. verticillioides\Nere predicted by comparison with the 
M. grisea and A. fumigatus genomes, respectively. The 
SMB gene clusters appear to be composed of genes 
with common sequence characteristics, even between 
genomes from different species with phylogenetically 
extensive distances. 

Despite the high probability of detecting the known 
SMB gene clusters described above, the detection of 
clusters for Pesl, fusaric acid and asperthecin failed. 
The Pesl and asperthecin biosynthesis gene clusters 
consisted of only two and three genes, respectively, 
and had little or no chance of having conserved hom- 
ologous pairs longer than three genes in the same 
order in the genome. The fusaric acid biosynthesis 
gene cluster, which contains a total of five genes, 
included three genes that had unique sequences. 
Given the abovementioned reasons, ~12.5% of the 
known SMB gene clusters are thoughtto remain unpre- 
dicted. The kojic acid biosynthesis gene cluster, which 
consisted of three genes with only weak similarity to 
t he ge nes seq ue need to d ate, was s uccessf u I ly detected , 
although itsclustersize wasoverestimated. It is thought 
that the existence of genes adjacent to a cluster with a 
high similarity to genes of a distantly related gene 
cluster led to the successful detection of this short 
gene cluster (Table 1 ). 

Consideringthe accuracy of detecting the known SMB 
gene clusters described above, the predicted unknown 
gene clusters without the core genes are highly likely to 
also be involved in SMB (see Supplementary Tables 
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^Species used for comparison when the gene cluster was detected with minimum error. 
''Number of predicted clusters in a comprehensive pairwise comparison of the 1 0 genomes. 

'^Difference in the numbers of genes upstream and downstream of the predicted gene cluster com pa red with the experimentally characterized cluster. The minus and 
plus quantities indicate under- and over-predictions, respectively. 

"^Error is defined as the sum of absolute values of the differences in gene number at both ends of a predicted gene cluster. , , 

^Epipolythiodioxopiperazine-type toxin. 
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E-value threshold (Paraml) Gap, Mismatch penalty (ParamZ ) Negative gene penalty (Param4} 



•'•••A. flavus 
A. fumigatus 

Figure 2. Analysis of the average predictionerrors. Averages of the minimum error for predicting the known gene clusters (solid line) of A. flavus 
and A. fumigatus are shown togetherwith the total numbers of genes in the predicted gene clusters of /A./Zoi't/s (dotted line) andA. fumigatus 
(broken line). The default values, except for the parameter indicated in each panel, were the same as those used in Fig. 3. 




E-value threshold(Paraml) Gap, Mismatch penalty {Pararin2 ) Negative gene penalty {Param4) 



EJOthers H PKS-NRPS hybrid 
ONRPS HPKS 

Figure 3. Prediction of known gene clusters. The numbers of known gene clusters that were predicted within the minimum error of 1 0 genes 
were analyzed by varying three parameters, Paraml , Param2 and Param4,one by one in a, b and c, respectively.The tentative default values, 
except for the parameter indicated in each panel, were Paraml = e - 1 0, Param2 = -0.2, ParamS = 3 5 genes, Param4 — - 0.3, ParamS = 
1 0 kband Param6 = 25%. 



S2-S5, complete lists of predicted clusters from A. nidu- 
lans, A. fumigatus, A. flavus and A. oryzae). To further 
evaluate the probability of a relationship with SMB, the 
content of Q (secondary metabolism) genes in the 
euKaryotic Orthologous Groups (KOG) functional cat- 
egory was analyzed. The ratios oftheQ genes in the pre- 
dicted clusters and the remainder of the genes on 
nonsyntenic blocks from the A. fumigatus genome were 
1 19/1 ,038and 1 00/2,297, respectively.Thesuccessive 
statistical analysis of this result indicated enrichmentfor 
Q genes in the predicted clusters with a P-valueof 10"^^, 
which strongly suggested that the predicted unknown 
gene clusters were related to SMB regardless of the exist- 
ence of core genes in the cluster. 

lnterestingly,some known gene clusters were detected 
by comparison with the gene cluster that appeared to 



have little relationship except for the core structure 
of the products, such as polyketide, a nonribosomal 
peptide. For example, the A. nidulans asperfuranone 
biosynthesis and/4, terreus lovastatin biosynthesis gene 
clusters (Fig. 5, Table 2) consisted of genes annotated as 
PKS, oxidoreductase and a transporter (Table 2). These 
gene clusters were aligned in the forward and reverse 
directions to create a seed (Fig. 5). 

3.3. Properties of secondary metabolism genes 

We devised a comparative genomics method for pre- 
dicting SMB gene clusters by effectively utilizing the 
rapidly growing accumulation of genome sequences. 
In this study, we have successfully identified the 
known SMB gene clusters with a high probability 
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Gene dustersize (the number of genes) 
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Figure 4. Prediction of Asper^/Z/us/ufn/^fltus gene clusters. The number of predicted gene clusters (Vaxis) in the indicated size range (Xaxis) was 
analyzed by varying Paraml , Param2 and Param4 using the results of A. fumigatus as an example. The default values, except for the 
parameter indicated in each panel, were the same as those used in Fig. 3. 



(21 outof 24clusters).The results indicate overall simi- 
larities in the amino acid sequences and /or the order of 
member genes between the pairs of gene clusters, in- 
cludingthose involved in the biosynthesis of /4. nidulans 
asperfuranone and A. terreus lovastatin, A. fumigatus 
fumitremorgin and F. verticillioides fumonisin and A. 
fumigatus melanin and F. graminearum aurofusarin. 

Secondary metabolism genes are highly enriched in 
the nonsyntenic blocks in a comparison of the 
genomes of three Aspergillus species.^ ^ We have 
applied this observation to our method and have suc- 
cessfully identified various known SMB gene clusters 
from the genomes of the 1 0 fungal species, including 
those outside the genus Aspergillus. This observation 
indicates that the high enrichment of secondary me- 
tabolism genes in nonsyntenic blocks was conserved 
among various species for at least the 10 fungal 
species used in this study. However, SMB gene clusters 
producing common products in phylogenetically 
close species may often be syntenic, as previously 
shown in various reports, which has resulted in the 
failure of detection by comparisons of the correspond- 
ingclustersin the respective genomes. Typical examples 
of unsuccessful detections involved the combinations 
of SMB gene clusters for/A./JflL^ws aflatoxin and/4, nidu- 
lans sterigmatocystin^^'^° and F. verticillioides and F 
oxysporum bikaverin cluster homologs.^' Similarly, 



horizontal transfer of a gene cluster may also result in 
unsuccessful detection of the cluster, even between 
species with large phylogenetic distances.^ ^'^^ An SMB 
gene cluster is known to consist of genes encoding 
proteins of particular characteristic functions, such as 
PKSs, NRPSs, Zn(ll)2-Cys6 transcription factors,^^ and 
major facilitator superfamily (MPS) transporters.^^ 
Significant enrichment of these genes allowed identi- 
fication of SMB gene clusters, owing to the overall simi- 
larity among various clusters producing different 
compounds and between species with large phylogen- 
etic distances. Our method, which first detected seeds 
by local gene alignments and successively corrected 
their boundaries using simple similarity searches inde- 
pendent of synteny, identified SMB gene clusters more 
efficiently than expected prior to this study, even 
though nonsyntenic blocks are known to have high 
diversity.^'^'^^ 

The previously reported methods predicted SMB 
gene clusters based on the sequence similarity of the 
core genes in the cluster, such as NRPS, PKS, a hybrid 
NRPS-PKS enzyme and DMAT*^'^'^ ^ In contrast to these 
methods, our method does notdepend on the presence 
of core genes. Due to this remarkable feature, the 
A. oryzae kojic acid biosynthesis gene cluster, which 
does not include core genes, was successfully predicted 
using this method. In contrast, there are also examples 
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<□ ^^<p<^ 



Correcting boundaries of gene duster 
and Synteny anatysis 



\7 



Aspergillus nidulans 

Part of the asperfuranone biosynthesis gene cluster 



o^^^'' o<^^^'' o^^^^'' 
fS^J'V^'^' fJ?"''' ^^^^ 




Aspergillus terreus 

Part of the lovastatin biosynthesis gene cluster 



homolog 



Figure 5. Schematic drawing of an example of a predicted l<nown SMB gene cluster. The top figures represent seeds used in the detection of a 
pair of SMB gene clusters for A. nidulans asperfuranone and A. terreus lovastatin. The left and the right panels show the alignments in the 
forward and reverse directions, respectively. The bottom figure shows all of the homologous gene pairs included between the two 
clusters. No orthologs were identified in this pair of gene clusters. 

Table 2. Examples of member genes in a predicted gene cluster 



GID 




Protein 


Predicted function 


GID 


Protein" 


Predicted function 


E-value'' 


ANID_ 


_01 030 


406 aa 


Zinc-binding oxidoreductase 


ATEG_ 


_09963 


364 aa 


hypothetical protein similar 
to enoyi reductase 


2.00E-1 8 


ANID_ 


_01 031 


564 aa 


MPS transporter 


ATEG_ 


_09967 


543 aa 


hypothetical protein similar 
to efflux pump 


7.00E-97 


ANID_ 


_01 032 


298 aa 


Conserved hypothetical protein 


ATEG_ 


_09962 


257 aa 


hypothetical protein similar 
to oxidoreductase 


7.00E-12 


ANID_ 


_01 034 


2723 aa 


Polyketide synthase 


ATEG_ 


_09961 


3005 aa 


hypothetical protein similar 
to polyketide synthase 


6.00E-57 


AN!D_ 


_01034 


2723 aa 


Polyketide synthase 


ATEG_ 


_09968 


2453 aa 


hypothetical protein similar 
to polyketide synthase 


5.00E-34 


AN!D_ 


_01 036 


2528 aa 


Polyketide synthase 


ATEG_ 


_09968 


2543 aa 


hypothetical protein similar 
to polyketide synthase 


O.OOE+00 


AN!D_ 


_01 036 


2528 aa 


Polyketide synthase 


ATEG_ 


_09961 


3005 aa 


hypothetical protein similar 
to polyketide synthase 


O.OOE+00 



^Length of the polypeptide in amino acids. 

''E-value of the similarity between the proteins of the detected gene clusters. 



of missing predictions of known short SMB gene clus- verticillioides (Table 1). The inability to identify the 
ters, such as those responsible for the biosynthesis of gene clusters named above was due to the existence of 
asperthecin in A. nidulans and fusaric acid in F. an inversion in the former cluster and unique genes in 
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the latterone, resulting in the failure of the local align- 
ment of homologous gene pairs. In both cases, the short 
sizes of the clusters (three to five genes) prevented the 
remaining portions of the clusters from being identi- 
fied. Similarly, intervention of a cluster by a horizontal 
gene transfer of another cluster (more than five 
genes), dividing the cluster into small segments,^^ 
may also cause detection failure. 

In this study, many known SMB gene clusters were 
identified within 1 9 genes as errors for the maximum 
CB score, when error is defined as the sum of absolute 
differences of cluster margins at both ends. Because 
our method does not depend on gene order within 
the length of the cluster size for the correction of 
cluster boundaries, this observation strongly suggests 
that the genes inside and outside of the clusters have 
different sequence characteristics. Accordingly, the 
probability of homology between the genes inside the 
clusters from the two genomes is significantly higher 
than (i) the probability of homology between the 
genesoutsidetheclustersor (ii) the probability of hom- 
ology between the genes inside the clusters and the 
genes outside the clusters (P= 6.2 x 1 0-1 21 , 
test). In contrast, the clustersizes of some gene clusters, 
e.g. the kojicacid biosynthesis gene cluster, were overes- 
timated. This overestimation suggests two possibilities: 
(i) the two clusters were located side by side with few or 
no non-SMB genes in between or (ii) the gene cluster 
may be a part of the ancestral SMB gene cluster, with 
the remainder of the genes being presently inactive. In 
the cases of clusters with errors larger than 1 0 at the 
maximum CB value, such as the clusters for aflatoxin, 
terrequinone and pseurotin biosynthesis as well as 
kojic acid biosynthesis in Table 1 , genes with character- 
istics of SMB genes were identified beyond the experi- 
mentally determined cluster margins. 

As described above, our method is a useful means to 
predict SMB gene clusters, particularly novel clusters 
without core genes; thus, this method has the potential 
to discover novel mechanisms of unknown SMBs. Two 
major problems of our method are that short gene clus- 
ters might not be detected in some cases and that the 
prediction of a cluster boundary might not always be ac- 
curate. Recently, a method for predicting accurate 
margins of SMB gene clusters by analyzing the co-ex- 
pression of neighboring genes has been reported,^^ 
with the condition that the gene indispensable for the 
SMB gene cluster is identified using the known se- 
quence of the core gene. Therefore, the combination 
of our method and the expression analysis method 
described above could effectively compensate for the 
problems that currently exist in both methods. 
However, our method is essentially not applicable to 
SMB gene clusters that are unique to particular 
genomes. Nevertheless, of the 24 known SMB gene clus- 
ters on the 1 0 genomes used in this study, 2 1 clusters 



were identified via comprehensive pairwise compari- 
sons. The problem of detecting 'rare SMB gene clusters' 
could be solved by increasing the number of genomes 
used for predictions.The acceleration of sequence accu- 
mulation due to the rapid development of sequencing 
technologies is expected to significantly increase our 
method's performance of in a short period of time. A 
com prehensiveanalysisofthe distribution of secondary 
metabolism genes and motifs in translated polypep- 
tides across d iverse species, together with the structura I 
analyses ofcorresponding compounds, will open a new 
era in the study of secondary metabolism. 

3.4. Availability 

We intend to provide the present method as a web 
se r V ice ( h tt p://www.f u ng-metb.net/). 
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